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MONTE CARLO APPROACH FOR RELIABILITY ESTIMATIONS 
IN GENERALIZABBLITY STUDIES 



OBJECTIVES 

The purpose of the present paper is to propose a Monte Carlo approach, by using the SAS 
programming language, for estimation of reliability coefficients in generalizability theory studies. 
Test scores are generated by a probabilistic model of the type Pip = F(9p,bi), where Pipis the 
probability for person p with ability score 9p to answer correctly item i with difficulty parameter 
bi . Under consideration are three types of reliabilty-like coefficients available in generalizability 
theory: (a) Generalizability coefficient, Gp, which is the analog of the classical reliability 
coefficient, (b) Index of dependability, <J), which is appropriate for criterion-referenced testing, 
and (c) Classification reliability index, <b(A.), for reliability of classification decisions based on a 
cutting score, X, (Brennan and Kane, 1977 ). The Monte Carlo approach is illustrated here for a 
single-facet crossed design but it works for higher levels of crossed and/or nested generalizability 
designs. The way it is designed, the SAS program allows flexibility and control on factors such as: 
(a) type of the probabilistic model F(9p, bj), (b) type of the 9pand bj distributions, (c) location of 
the cutting score, X, and (d) the amount of information provided by each item, 1^(9). 

FRAMEWORK 

In the classical approach of measurement, under the attempt to standardize measurement 
condi- tions, there are only two sources of score variation - variance across people (true variance) 
and variance of unspecified sources (error variance). In the generalizability theory approach of 
mea- surement it is possible to asses multiple sources of variation. For example, in addition to 
items, raters may be involved as a measurement condition which is permitted to vary. The 
variance across the object of measurement is called true variance and the variance associated 
with all other sources is called error variance. The sources of measurement error are referred to 
as facets of measurement. The process through which measurement scores are gathered may be 
crossed or nested. In a Generalizability theory study (G-study), the purpose is to estimate the va- 
riances associated with various facets of measurement. Further, consequences of various changes 
in the measurement design can be investigated to seek the optimal design. For example, one may 
investigate the reduction of error as a result of treating some of the facets as fixed. The inve- 
stigations of various designs are referred to as Decision studies (D-studies). In the classical test 
theory all facets are generally assumed standardized and the only facet is items. This is analogous 
to a G-study in which the concern is with the variance of the total score. 

The major statistical diflference between a crossed design and a nested design is that the 
main eflfect variance component due to the nested facet is not separable statistically from the 
interaction variance component through ANOVA. This fact leads to more conservative 
generalizability coefficients in the D-studies than their crossed design counterparts. For an i(p) 
design, items (i) nested within persons (p), the variance of all observed scores, Xip, is given by: 

a^(Xip) = + af(p) (1) 




3 



2 



For a single-facet crossed design p x i, where persons are crossed with items, the variance of the 
observed scores, Xip, is given by: 



a^(Xip) = a^+af + afp 



( 2 ) 



Related to the reliability coefficients under consideration in this paper, Gp,0, and 0(X), 
are also the concepts of absolute error variance, a-(A), and relative error variance, a^(5). The 
absolute error is the difference between a person's observed and universe scores, Ap = Xip - ^p, 
and the relative error is the difference between a person's observed deviation score and his/her 
universe deviation score, 5p = (Xip - p.i) - (p.p - p.). 



If a^(x) stands for the variance of the object of measurement, x, in a G-study, the 
reliability coefficients Gp, O are defined as follows: 



Gp = 



c^(x>kj2(6) 



(3) 



0 = 






(4) 



If person's achievement is the only object of measurement, x = p, then a^(x) = Cp and the 
classification reliability index is defined by: 




Op-Kp-X)^ 



(5) 



An unbiased estimate of (p-X)^ is (X- -a^(X), where X is the mean test score of np 
persons on ni test items. 

For example, for one-facet crossed design "persons x items" the mean score variance is given by: 

= a^/np+ af/ni + afp/(npni) (6) 

and the absolute error variance is given by: 

a^(A) = af/ni + afp/nj (7) 

By taking into account (6) and (7), the following unbiased estimate of the classification reliability 
index can be obtained from (5) for one-facet crossed design: 



<D(?t) = 



Op +(X-X)2-c^ (X) 

Op +o^ (AWX-A.)^-a^ (X) 




( 8 ) 



METHOD 



Developed was a relatively simple SAS program for Monte Carlo simulations and 
calculation of the reliability indices Gp, O, and 0(X) from formulas (3), (4) and (5), respectively. 
It assumes an error-free measure of person's ability, 0p, on the same scale of measurement for 
item's difficulty, bi, and allows different simulations of the 0p and bi distributions. It also allows 
different simulations of the probabilistic model Pip = F(0p, bi) for calculating the probability that a 
person with ability score 0p will answer correctly item with difficulty bi. The user of the SAS 
program has the freedom to set different cut-off scores. A,, for the calculation of 0(A,), and to 
select simulated items that provide information above some specified amount, lj(0). 

The results, reported later in this paper, were obtained from Monte Carlo simulations of 
different IRT models. For the purposes of unification and comparisons, same number of 
replications (NR = 20), same number of persons (NP = 100), and same number of items (NI = 20) 
were used for each IRT model. Hence, 40000 observations were manipulated with each execution 
of the respective SAS program. Used were the following illustration models: 

(A) Two-parameter IRT model; Both person's ability, 0p, and item difficulty, b;, were 
taken from the standard normal distribution, by using the SAS function RANNOR, and 
substituted in the one-parameter IRT model; 



the item discrimination parameter, a, , is proportional to the slope of the item characteristic curve 
(ICC) at the point bj on the ability scale. The constant D was fixed to 1.7 which made the model 
in (9) almost the same as the normal-ogive model (see, e.g., Hambleton et al., 1991, p. 15 ). The 
score, Xip, of person p on item i was simulated by the SAS function RANBIN(0, l,Pip). The 
ANOVA procedure was used for the statistics involved in the single-facet crossed design. 

(B) One-parameter IRT model: The simulations of this model were conducted under the 
same conditions as for the two-parameter model in (9), with the difference that the item 
discrimination parameter was the same for all items (a^ = a): 



The results were compared for three different values of the fixed parameter, a = . 1, a = .5, and 
a = 1.8, looking for possible changes in the reliability coefficients Gp, d>, and d>(A,). 

(C) Rasch's Poisson model; The general form of this model (see, e.g., Allen & Yen, 
1979, p.250), was reduced here to the calculation of probability of person) making no error on 
(i.e. giving correct answer to ) an item i: 



Pip = {l+exp[(D)(ai)(0p-bi)]}-' 



( 9 ) 



P.p={l + exp[1.7(a)(0p-bJ}-i 



( 10 ) 



P(X.. = 0| X..) = exp(- X..), 



( 11 ) 



where A,y = ^ . The item difficulty, b^ , and person's ability, 9^ , were generated randomly from the 
uniform distribution on the interval (0,1), by using the SAS function UNIFORM. 



4 



(D) Chi-square model: This model was used for the illustration of a hypothetical 
measurement situation where, for example, the person's ability level is fixed and the probability for 
a given dichotomous response, P^, decreases when the product of factors such as person's anxiety 
level, ^ , and task difficulty, , increases. Then, the respective function = F(0j, tj) may be 
approximated, say, by the chi-squared probability distribution function. This hypothetical situation 
is illustrated in the current study with the simulation of P^ by the use of the SAS function 
PROBCHI for chi-squared probability distribution, with degrees of freedom fixed to df = 3. 

(E) Uniform model: With this model, the probability P^ for person j to answer correctly 
item i was generated as a pseudo-random variate uniformly distributed on the interval (0,1), by 
using the SAS function UNIFORM. 

After determining the P^ value in the any of the above models, a dichotomous response of 
0 or 1 was generated as an observation from the binomial distribution B(n, x, p), with n = 1, x = 
1, and p = Pi , by using the SAS function RANBIN. The entire Monte Carlo procedure, with 20 
replication for 100 persons and 20 items, was executed for 5 different cut-off score values, X , for 
the analysis of the classification reliability, ; (X, = 6, 8, 10, 12, 14). These cut-off scores are 
symmetrically located around the theoretical average score, p = 10, on the 20 item test for the 
uniform and logistic models used in this study. For the illustrative examples, the distribution mean 
of the scores generated from the chi-square probability model was close to p = 12, and the score 
mean for the Rasch's Poisson model was close to p = 8. 



RESULTS 



Table 1 summarizes the results from the Monte Carlo simulation of the six models 
described above in (A) - (E). A general inspection of the reliability indices reveals consistency of 
the results with the respective theoretical relations between them. For instance, the generalizability 
coefficients, G , are systematically higher than the dependability indices, O. This can be seen 
from the comparison of formulas (3) and (4), taking into account that a^(A)>a^(5). Also, The 
comparison of classification indices, <I5(X,), calculated for different cut-off scores, X , confirms the 
theoretical fact that the closer X to the distribution mean, p, the lower 0(X.) - see formula (5). 

The standard deviations of the reliability indices, across their values from the 20 replications, were 
very small and varied between 0.02 and 0.0007. All those facts, along with the high values of all 
reliability indices, support the validation of the proposed Monte Carlo procedure. 

From another perspective, the comparison of Gp and O indicates that the generalizability 
coefficient, Gp , is more resistant to the type of probability model. As it was mentioned at the 
beginning, Gp is the analog of the classical reliability coefficient, whereas is appropriate for 
criterion-referenced testing. 




In terms of 0(X,), the Chi-square and the Uniform models are more sensitive to the 
position of the cut-off score, X, compared to the other three models. Also, the one parameter 
logistic model leads to higher ^(X.) values, compared to the two parameter logistic model, 
especially when X, is close to the population mean score (in this example, p. — 10 for the two 
logistic models). 
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CONCLUSION 

The Monte Carlo approach developed in this study allows the calculation of all reliability 
indices, Gp, O, and 0(X,), in G-studies, for different distributions of person's ability scores, 0p, 
different distributions of item difficulty, b; , and different IRT-type models for the probability of 
correct answer, = F(0p , b^ ). One can also incorporate models for the probability of responses 
based on personality and/or task parameters other than 0p and b^, respectively. For the purpose of 
classification reliability estimations, one can vary the position of the cut-off score, X, , for different 
optimizations of the 0(X.) index. 

Technically, it is easy to set restrictive conditions about the amount of information 
provided by the simulated items, range of person's ability score, and other factors influencing the 
quality of the measurement. For example (see, e.g., Hambleton et al., 1991, p. 97), for the two 
parameter logistic model in (9), the amount of item information, l(0j), is calculated by a formula 
that is easy to incorporate in the SAS program given in the appendix: 



I(0j) = 2.89(aO(Pij)(l-P.,) 



In addition to the applications discussed above, one can use the variance components 
calculated by the SAS procedure (see Appendix) for other important purposes such as error 
reduction, confidence interval calculations, and standard settings in various D-study designs. 
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Table 1 

Monte Carlo averaged estimates of GT reliability indices from 20 
replications of simulated observations on 100 persons and 20 items 



Probability 

model 


VO 

II 


Classification index, 0(X) 

X = S X=\0 X=12 X 


= 14 


Dependability 
index, O 


Generalizability 
coefficient, Gp 


2P logistic 


.973 


.927 


.826 


.926 


.973 


.848 


.979 


IP logistic 


.975 


.949 


.924 


.949 


.976 


.927 


.982 


Rasch' Poisson 


.958 


.929 


.961 


.983 


.991 


.932 


.982 


Chi-square 


.984 


.968 


.912 


.634 


.786 


.682 


.975 


Uniform 


.973 


.919 


.740 


.911 


.972 


.787 


.976 



Note. X = cut-off score for classification decisions ; 2P logistic = Two parameter logistic model; 
IP logistic = One parameter logistic model. 
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appendix 



qAS PROGRAM FOR MONTE CARLO RELIABILITY ESTIMATIONS IN C 

1m mS^atlon versionJor_the_two-parameter_lo^ 



DATA CROSSED; 

NR = 20; 

NP = 100; 

NI = 20; 

NRLl = 3*NR - 2; 

NRL2 = NRLl + 1; 

NRL3 = NRL2 +1; 

DO REP = 1 TO NR; 

DO PERS = 1 TO NP; 
TP=RANNOR(0) ; 
scrp=0 ; 

DO ITEM= 1 TO NI; 

TI= RANNOR(O) ; 

A = UNIFORM (0) ; 

E = - 1.7* (TP-TI) *A; 

P = 1/(1 + 2 .7182818**E) ; 



SCOR = RANBIN (0 , 1 , P) ; 
scrp = scrp + scor; 
OUTPUT ; 



END; 

END; 



END; 

PROC SORT; BY REP; 

PROC ANOVA OUTSTAT= POST ; 

CLASSES PERS ITEM; BY REP ; 

MODEL SCRP= PERS ITEM PERS* ITEM/nouni ; 
OUTPUT OUT= POST SS DF ; 



RUN; 



data numb; _ -ttit i n 

set crossed (KEEP=NR NP NI NRLl NRL2 NRL3) ; 

run ; 



data score; 
set crossed; 
scorepi = scrp; 
run; 

data adj scorer- 
set numb; 

NPR = NR*NP*NI; 
do n = NI to NPR by NI ; 
set score point=n; 
if error_=l then abort; 
output ; 

.end; 

stop; 



O 




-STUDIES 
model ) 
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APPENDIX (cont.) 



DATA KUKA; 

SET ad j score; 

PROC SORT; BY REP; 

PROC MEANS; 

VAR SCOREPI; BY REP; 

OUTPUT OUT=BARR MEAN=XBAR; 
RUN; 



DATA TWO; 

SET POST; 

IF DF > 0 THEN MS=SS/DF; ELSE DELETE; 
RUN; 



DATA TWOl; 
SET TWO; 
MSP=MS; 
RUN; 



DATA TW02; 
SET TWO; 
MSI=MS; 
RUN; 



DATA TW03; 
SET TWO; 
MSPI=MS; 
RUN; 



DATA PERS; 
set numb; 

DO N=1 TO NRLl BY 3; 

SET TWOl POINT=N; 

IF _ERROR_=l THEN ABORT; 
OUTPUT ; 

END; 

STOP; 

RUN; 



DATA ITEM; 
set numb; 

DO N = 2 TO NRL2 BY 3; 

SET TW02 POINT = N; 

IF _ERROR_=l THEN ABORT; 

OUTPUT; 

END; 

STOP; 

RUN; 

11 
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APPENDIX (cont.) 



DATA PERSITEM; 
set numb; 

DO N = 3 TO NRL3 BY 3 ; 

SET TW03 POINT=N; 

IF _ERROR_=l THEN ABORT; 

OUTPUT; 

END; 

STOP; 

RUN; 

DATA RE SUL ; 
set numb; 

MERGE PERS ITEM PERSITEM ; 

NPI=NP*NI; 

SGMP2= (MSP-MSPI) /NI ; 

IF SGMP2 < 0 THEN SGMP2=0; 

SGMI2= (MSI-MSPI) /NP; 

IF SGMI2<0 THEN SGMI2=0; 

SGMPI2=MSPI; 

SX2=SGMP2/NP + SGMI2/NI + SGMPI2/NPI; 
GR=SGMP2/ (SGMP2 + SGMPI2/NI); 

SGMDLT2 = SGMI2/NI + SGMPI2/NI; 
FI=SGMP2/ (SGMP2 + SGMDLT2 ) ; 



DATA KP; 

SET RESUL (KEEP=FI GR SX2 SGMDLT2 SGMP2); 



DATA FNL; 

MERGE BARR KP ; 

DO LBD = 4 TO 14 BY 2; 

DEV2 = (XBAR - LBD) * (XBAR -LBD) - SX2 ; 

FIL = (SGMP2 + DEV2)/(SGMP2 + SGMDLT2 + DEV2 ) ; 
OUTPUT; 



proc sort ; by LBD ; 

proc means; var xbar gr fi fil; by Ibd; 



Note: All changes related to the use of different probability models, 
different simulations of person's ability and/or item difficulty, and 
different restrictive conditions on the model, concern only the first 
data set (DATA CROSSED) . They should be done by using the respective 
SAS manipulations, whithout changing the other data sets. Changes 
related to the cut-off score values concern only the DO command in the 
last data set (DATA FNL) . 



RUN 



RUN 



END 



RUN 
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